Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/FrankDevg/imbd_scrapper_project/llms.txt

Use this file to discover all available pages before exploring further.

The IMDb Scraper is fully containerized using Docker Compose, enabling one-command deployment with all dependencies included.

Architecture Overview

The Docker environment consists of four main services:
  • postgres: PostgreSQL 15 database for structured data storage
  • tor: TOR proxy for IP rotation and anonymity
  • vpn: Gluetun VPN container for geolocation changes
  • scraper: The main application container

Docker Compose Configuration

The complete orchestration is defined in docker-compose.yml:
version: "3.9"

services:
  postgres:
    image: postgres:15
    container_name: imdb_postgres
    restart: always
    environment:
      POSTGRES_DB: ${POSTGRES_DB}
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    ports:
      - "${POSTGRES_PORT}:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./sql:/docker-entrypoint-initdb.d
    networks:
      - app_net

  tor:
      image: dperson/torproxy
      container_name: tor_proxy
      restart: always
      ports:
        - "9050:9050" # SOCKS port for traffic
        - "9051:9051" # Control port for commands (IP rotation)
      command: >
        sh -c "tor --SocksPort 0.0.0.0:9050 --ControlPort 0.0.0.0:9051 --HashedControlPassword '' --CookieAuthentication 0"
      networks:
        - app_net

  vpn:
    image: qmcgaw/gluetun
    container_name: vpn
    cap_add:
      - NET_ADMIN
    environment:
      - VPN_SERVICE_PROVIDER=protonvpn
      - OPENVPN_USER=${VPN_USERNAME}
      - OPENVPN_PASSWORD=${VPN_PASSWORD}
      - SERVER_COUNTRIES=Argentina
    ports:
      - "8888:8888"
    networks:
      - vpn_net

  scraper:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: imdb_scraper
    depends_on:
      - postgres
      - tor
      - vpn
    networks:
      - app_net
      - vpn_net
    env_file:
      - .env
    volumes:
      - .:/app
    command: >
      sh -c "
        echo 'Waiting for Postgres to be ready...' &&
        while ! nc -z postgres 5432; do sleep 1; done &&
        echo 'Postgres ready. Starting scraper...' &&

        echo 'Waiting for Tor to be ready...' &&
        while ! nc -z tor 9050; do sleep 1; done &&
        echo 'SOCKS port ready. Tor fully initialized.' &&

        python presentation/cli/run_scraper.py &&
        echo 'Scraper finished. Running queries.sql...' &&
        PGPASSWORD=$POSTGRES_PASSWORD psql -h postgres -U $POSTGRES_USER -d $POSTGRES_DB -f sql/queries.sql &&
        echo 'SQL queries executed. Keeping container active...' &&
        tail -f /dev/null
      "

volumes:
  pgdata:

networks:
  app_net:
    driver: bridge
  vpn_net:
    driver: bridge

Dockerfile Breakdown

The scraper container is built from this Dockerfile:
FROM python:3.11-slim

ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /app

# Install system dependencies including postgresql-client
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    tor \
    curl \
    netcat-openbsd \
    gnupg \
    ca-certificates \
    postgresql-client \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the project
COPY . .

# Default command
CMD ["python", "presentation/cli/run_scraper.py"]

Key Components

Uses python:3.11-slim for a lightweight Python environment with minimal footprint.
  • gcc: Required for compiling Python packages with C extensions
  • libpq-dev: PostgreSQL development libraries for psycopg2
  • tor: TOR network client
  • netcat-openbsd: Network utility for health checks
  • postgresql-client: CLI tools for database operations
Installed from requirements.txt without cache to reduce image size.

Service Dependencies

The scraper container has explicit dependencies:
depends_on:
  - postgres
  - tor
  - vpn
And includes health checks in the startup command:
  • Waits for PostgreSQL on port 5432
  • Waits for TOR SOCKS proxy on port 9050
  • Only starts scraping when both services are responsive

Volume Mounts

Database Persistence

pgdata:/var/lib/postgresql/data ensures PostgreSQL data survives container restarts

SQL Initialization

./sql:/docker-entrypoint-initdb.d auto-runs SQL scripts on first startup

Application Code

.:/app mounts the entire project for development (can be removed in production)

Build and Run

Initial Build

docker-compose build --no-cache
This builds the scraper image from scratch, ensuring all dependencies are fresh.

Start All Services

docker-compose up
Or run in detached mode:
docker-compose up -d

View Logs

# All services
docker-compose logs -f

# Specific service
docker-compose logs -f scraper

Stop Services

docker-compose down

Rebuild After Changes

docker-compose down
docker-compose build --no-cache
docker-compose up

Port Mappings

ServiceInternal PortExternal PortPurpose
postgres5432$Database connections
tor90509050SOCKS proxy traffic
tor90519051Control port (IP rotation)
vpn88888888VPN HTTP proxy

Accessing Services

1

PostgreSQL Database

Connect from host machine:
psql -h localhost -p 5432 -U aruiz -d imdb_scraper
2

Scraper Logs

Real-time logs available in logs/scraper.log or via Docker:
docker logs -f imdb_scraper
3

Generated Data

CSV files appear in data/ directory:
  • movies.csv
  • actors.csv
  • movie_actor.csv

Troubleshooting

If the scraper exits immediately, check that:
  • The .env file exists with all required variables
  • PostgreSQL container is healthy: docker ps
  • TOR proxy is responding: docker logs tor_proxy

Common Issues

Database Connection Failed
# Check if postgres is running
docker ps | grep postgres

# View postgres logs
docker logs imdb_postgres
TOR Not Ready
# Verify TOR is listening
docker exec tor_proxy netstat -tuln | grep 9050
VPN Connection Issues
# Check VPN status
docker logs vpn

Production Considerations

Remove Volume Mount

Change - .:/app to only mount necessary files, not entire codebase

Use Secrets

Replace .env file with Docker secrets or external secret management

Health Checks

Add explicit healthcheck directives to docker-compose.yml

Resource Limits

Set memory and CPU limits for each service

Next Steps

Environment Variables

Configure database, proxy, and VPN credentials

Network Configuration

Understand Docker networks and proxy setup